Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch cuda dev tools (#736) #743

Merged
merged 11 commits into from
Jan 23, 2024
Merged

Patch cuda dev tools (#736) #743

merged 11 commits into from
Jan 23, 2024

Conversation

cboettig
Copy link
Member

@cboettig cboettig commented Jan 7, 2024

Just a draft proposal here so far, but I think the least-friction way to do this. Doesn't impact the normal r-ver stack. In the cuda stack, the idea would be to copy the compiled version of R from r-ver rather than run the typical install_R_source.sh script, which creates the problems in purging dependencies. To make this work the runtime dependencies still need to be installed, so for simplicity I've added these in another script. (technically that means we have these dependencies specified in two places now, so I could adjust install_R_source.sh to source the new script instead, not sure if that's preferable?).

I haven't taken a go at defining the new bakefile template yet. Hopefully straight forward, though will need to make sure it gets the right version tag for r-ver images in the non-latest images; I haven't thought out the logic there.

@eitsupi thoughts?

@eitsupi
Copy link
Member

eitsupi commented Jan 7, 2024

Thanks for working on this!
I think this is a good direction, but looking at the original script, wouldn't it be cleaner and easier to add an option to skip the part that is building R?

## Download R from 0-Cloud CRAN mirror or CRAN
function download_r_src() {
wget "https://cloud.r-project.org/src/$1" -O "R.tar.gz" ||
wget "https://cran.r-project.org/src/$1" -O "R.tar.gz"
}
if [ "$R_VERSION" == "devel" ]; then
download_r_src "base-prerelease/R-devel.tar.gz"
elif [ "$R_VERSION" == "patched" ]; then
download_r_src "base-prerelease/R-latest.tar.gz"
elif [ "$R_VERSION" == "latest" ]; then
download_r_src "base/R-latest.tar.gz"
else
download_r_src "base/R-${R_VERSION%%.*}/R-${R_VERSION}.tar.gz"
fi
tar xzf "R.tar.gz"
cd R-*/
R_PAPERSIZE=letter \
R_BATCHSAVE="--no-save --no-restore" \
R_BROWSER=xdg-open \
PAGER=/usr/bin/pager \
PERL=/usr/bin/perl \
R_UNZIPCMD=/usr/bin/unzip \
R_ZIPCMD=/usr/bin/zip \
R_PRINTCMD=/usr/bin/lpr \
LIBnn=lib \
AWK=/usr/bin/awk \
CFLAGS="-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g" \
CXXFLAGS="-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g" \
./configure --enable-R-shlib \
--enable-memory-profiling \
--with-readline \
--with-blas \
--with-lapack \
--with-tcltk \
--with-recommended-packages
make
make install
make clean

@cboettig
Copy link
Member Author

cboettig commented Jan 7, 2024

Thanks @eitsupi ! Yes, great point, that does sound like a much simpler solution! I'll scratch this and take a go at that instead.

@cboettig cboettig marked this pull request as ready for review January 16, 2024 19:40
@cboettig
Copy link
Member Author

@eitsupi okay looping back to this using your simpler mechanism. Can you take a look over this when you get a chance? Thanks!

@eitsupi
Copy link
Member

eitsupi commented Jan 17, 2024

I'm sorry my comment is confusing.
What I was trying to say was that rather than adding a new script, it would be better to use the same script with the flag added to install_R_source.sh and skipped the "build R" steps so that we don't have to have almost identical copies in two files.

@@ -157,7 +158,9 @@ rm -rf "R.tar.gz"
cp /usr/bin/checkbashisms /usr/local/bin/checkbashisms

# shellcheck disable=SC2086
apt-get remove --purge -y ${BUILDDEPS}
if [ "${PURGE_BUILDDEPS}" == "TRUE" ]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eitsupi you said:

it would be better to use the same script with the flag added to install_R_source.sh and skipped the "build R" steps so that we don't have to have almost identical copies in two files.

I am adding a flag to install_R_source.sh right here though. what two files are you referring two as almost identical copies?

Copy link
Member

@eitsupi eitsupi Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the assumptions were not engaged. I had imagined something like the following:

FROM rocker/r-ver:4.3.2 as builder

FROM FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
COPY --from=builder /foo/bar
COPY scripts/install_R_source.sh /rocker_scripts/install_R_source.sh

# should install the necessary packages, but should skip building R
ENV SKIP_BUILD_R="true"
RUN /rocker_scripts/install_R_source.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see what you mean. That looks rather more complicated though.

  • It requires using the multi-stage build, and i'm not sure how easy that is to add to our existing build stack.
  • Copying /foo/bar isn't so clean, because our standard build installs R_HOME into /usr/local, with binaries in /usr/local/bin, etc, rather than freestanding. Yes, I proposed a solution around this in my previous proposal, but it is a significant change that could break some user expectations.
  • SKIP_BUILD_R would have to wrap around thin install-time deps, building of R and and clean up and purge of dependencies. That's fine, but it feels more natural to me to have an option that just avoids the apt remove step. After all, given that apt remove has unexpected consequences, it seems reasonable to be able to opt out of that without opting out of the rest of it. and intuitively, it's easier I think to guess what install_R_source.sh does behaves in response to a option of PURGE_BUILDDEPS=FALSE than it is to guess what install_R_source.sh does in response to an option of SKIP_BUILD_R.

What's the downside of the approach here? I still don't see what's been duplicated.

@cboettig
Copy link
Member Author

apologies about my lint errors, looks like a whitespace issue but I'm not sure.

Happy to change strategies again if that's better. I think adding the option of simply turning off the uninstall of builddeps is the least invasive way to address this issue with minimal changes. Please let me know if there's a a better alternative, but I think I'd rather avoid introducing staged builds merely to address the problems created by the apt-remove script. (If we're going to introduce support for staged builds, I think it makes most sense to do so in a refactor of r-ver, where that is the natural mechanism in modern Docker to accomplish what we do via apt remove line, as you know.)

I did slip a few other changes in here that look like they got missed in #740 (that updated the necessary env vars for most but not all recipes that use the the revised install_python.sh script, but I think we want that on cuda images too). This also corrects a small permissions issue in install_python.sh. Let me know if it would be better to do these bits as a separate PR, just thought it made sense to fix them while I was testing these cuda images anyhow.

Thanks for the review and suggestions!

@@ -258,7 +258,7 @@
"PYTHON_CONFIGURE_OPTS": "--enable-shared",
"RETICULATE_AUTOCONFIGURE": "0",
"PURGE_BUILDDEPS": "FALSE",
"VIRUTAL_ENV": "/opt/venv",
"VIRTUAL_ENV": "/opt/venv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! will be nice to have /opt/venv work on these images.

@eitsupi
Copy link
Member

eitsupi commented Jan 18, 2024

@cboettig I understood. Indeed, I think this simple solution can be implemented relatively quickly and is worthwhile.

By the way, the Lint error looks like an indent size issue.

@@ -11,6 +11,7 @@
set -e

R_VERSION=${1:-${R_VERSION:-"latest"}}
PURGE_BUILDDEPS=${PURGE_BUILDDEPS=-"TRUE"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bools in env vars should be "true" or "false", not "TRUE" or "FALSE".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think env vars are untyped so 1 and "1" and "TRUE" and "True" and "true" are all the same. Unless, of course, the receiving end makes a difference when it parses. Otherwise the check is commonly 'set or not set'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a common practice in shell script?
I saw in bash:

$ if "true"; then echo "true"; fi
true

$ if "false"; then echo "true"; fi

$ if "1"; then echo "true"; fi
bash: 1: command not found

$ if "TRUE"; then echo "true"; fi
bash: TRUE: command not found

$ if "True"; then echo "true"; fi
bash: True: command not found

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but as you I did this a string comparison since booleans in bash always confuse me. Anyway I've changed these to lower-case now. Thanks for catching this! let me know if you'd recommend further changes here.

@@ -148,15 +148,17 @@
"org.opencontainers.image.title": "rocker/cuda",
"org.opencontainers.image.description": "NVIDIA CUDA libraries added to Rocker image."
},
"FROM": "nvidia/cuda:11.8.0-cudnn8-devel-ubuntu24.04",
"FROM": "nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this value is updated by the script, so don't update it here manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this when the script added it. Are we sure a) the production cycle for Ubuntu 24.04 has started and b) that NVidia already publishes to it? I was under the impression NVidia was particularly 'laggy' ie the gcc version implied by nvcc was always ages behind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no nvidia/cuda:11.8.0-cudnn8-devel-ubuntu24.04. There are

  • nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
  • nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
  • nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04 (EOL)

See https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.8.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all. @eitsupi can you clarify? If this field is not read because it is provided by the script then maybe we can drop it or stick in a dummy value? It seems needlessly confusing to reference a value that isn't going to be used and isn't actually a valid image name but just happens to look almost like a valid image name, right?

Copy link
Member

@eitsupi eitsupi Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is "not read", but "don't set it manually, because if the script is execute by CI, it will be updated".

The priority here is to use the same Ubuntu version as rocker/r-ver. I haven't found an easy way to check if that image actually exists (This is definitely possible using buildx, but I couldn't figure out how to check without logging into the Container Registry.), so no such processing is implemented.
If you try to build it, it will simply fail.

Currently pointing to Ubuntu 24.04 is not as intended, I really wanted it to be 22.04, which is the same as the current ubuntu:latest. In other words, this is a bug in the script. (The 24.04 release date is later than today, so it should be filtered out)

"FROM": "ubuntu:latest",

# Update the cuda base image
template$stack[[9]]$FROM <- .cuda_baseimage_tag(dplyr::last(df_ubuntu_lts$ubuntu_series))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it now reads 22.04, and I think should work as intended. let me know if anything needs to be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it now reads 22.04, and I think should work as intended. let me know if anything needs to be changed?

Run the scripts to update the files. It should be back.

## Automatic update of container definition files
The Dockerfiles for pre-build images will be generated by executing the scripts in the following order. It will also generate the `docker-bake.json` files that can be used to build the container images. `make-matrix.R` will generate files containing the matrix used in GitHub Actions.
1. `./build/make-stacks.R`
2. `./build/make-dockerfiles.R`
3. `./build/make-bakejson.R`
4. `./build/make-matrix.R`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eitsupi ok I see what you mean, make-stacks.R tries to figure this out automatically and it's coming up with the nonsense answer of 24.04 as the answer on the devel image. I still think that's confusing (doesn't that mean we're just failing to build a devel flavor for the whole cuda stack?) but I agree it's a separate issue and not one that needs to be solved in this PR.

I've just pushed an edit to roll that back.

I thought I fixed whitespace but am still struggling with the linter :-(

@cboettig
Copy link
Member Author

@eitsupi any further thoughts or edits you'd like to see here before we merge?

@eitsupi
Copy link
Member

eitsupi commented Jan 22, 2024

Don't you need to edit other stack files?

Don't forget to also run other build scripts.

The linter error indicates that the number of indentations is different. (The indent size in other places is 4, but the line you added is 2)

@cboettig
Copy link
Member Author

Thanks @eitsupi ! I've fixed the whitespace issue with linting now and run the remaining scripts.

I'd like to at least start with these changes only in the active images (devel and latest, though looks like the scripts automatically back-port this to the most recent locked release as well). Traditionally we have allowed images to continue to evolve at the front end while trying to keep old versions as stable and locked as possible, at least while still working out features. In some cases (maybe including this one) we might want to port this to all 4.x releases, but maybe we can revisit that once this is merged and working?

  • do we want to run the build scripts manually in the commit as well? would it make sense to have that task performed by CI?
  • if running these four build scripts is our intended workflow for PRs, (or even for testing PRs), I wonder if it would not make sense to have them bundled under a Makefile directive for that?
  • we've moved to a more portable linter, right? Might be nice to have a make lint option.

@eitsupi
Copy link
Member

eitsupi commented Jan 23, 2024

  • do we want to run the build scripts manually in the commit as well? would it make sense to have that task performed by CI?
  • if running these four build scripts is our intended workflow for PRs, (or even for testing PRs), I wonder if it would not make sense to have them bundled under a Makefile directive for that?

As your are aware, these scripts run automatically once a day on the main branch, so in many cases there is no problem even if no changes are made on the PR.
However, I was concerned that a merge without recognizing this might cause unexpected changes.

Copy link
Member

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to at least start with these changes only in the active images (devel and latest, though looks like the scripts automatically back-port this to the most recent locked release as well). Traditionally we have allowed images to continue to evolve at the front end while trying to keep old versions as stable and locked as possible, at least while still working out features. In some cases (maybe including this one) we might want to port this to all 4.x releases, but maybe we can revisit that once this is merged and working?

Makes sense. Let's try merging.

@eitsupi
Copy link
Member

eitsupi commented Jan 23, 2024

I don't know what the CI failures indicate - TeX related stuff breaks a lot, so I don't know if this is temporary or not. I don't think it has anything to do with this PR.

@cboettig
Copy link
Member Author

the tex error confuses me too. Like you say, it's always a fragile point. locally that test has no problems for me.

@cboettig cboettig merged commit 96f7241 into master Jan 23, 2024
47 of 48 checks passed
@cboettig cboettig deleted the patch/cuda branch January 23, 2024 19:16
@@ -11,6 +11,7 @@
set -e

R_VERSION=${1:-${R_VERSION:-"latest"}}
PURGE_BUILDDEPS=${PURGE_BUILDDEPS=-"true"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe typo?
PURGE_BUILDDEPS=${PURGE_BUILDDEPS:-"true"}

eitsupi added a commit that referenced this pull request Jan 24, 2024
@eitsupi
Copy link
Member

eitsupi commented Jan 24, 2024

See #750 and #751.

This PR breaks a lot of things...

eitsupi added a commit that referenced this pull request Jan 24, 2024
cboettig added a commit that referenced this pull request Jan 25, 2024
@eitsupi my apologies, I didn't realize my local test was pulling the
upstream image. The tex issue is specific to the changes in #743 --
specifically, the official ubuntu repo texlive binaries are installed
and purged in the BUILDDEPS list. By not purging builddeps, we had
conflicting versions of texlive.

This PR modifies the `install_texlive.sh` to not do the manual install
of tlmgr from CPAN if the texlive has already be installed by apt-get
(i.e. `/usr/bin/latex` already exists). I believe it should resolve the
issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants